Goto

Collaborating Authors

 validation technique


Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights

arXiv.org Machine Learning

Machine Learning is transforming medical research by improving diagnostic accuracy and personalizing treatments. General ML models trained on large datasets identify broad patterns across populations, but their effectiveness is often limited by the diversity of human biology. This has led to interest in subject-specific models that use individual data for more precise predictions. However, these models are costly and challenging to develop. To address this, we propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis at both group and subject-specific levels. We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics. Different validation techniques were applied to evaluate accuracy and feature importance consistency. To introduce variability, we performed up to 400 trials per subject, randomly seeding the ML algorithm for each trial. This generated 400 feature sets per subject, from which we identified top subject-specific features. A group-specific feature importance set was then derived from all subject-specific results. We compared our approach to conventional validation methods in terms of performance and feature importance consistency. Our repeated trials approach, with random seed variation, consistently identified key features at the subject level and improved group-level feature importance analysis using a single general model. Subject-specific models address biological variability but are resource-intensive. Our novel validation technique provides consistent feature importance and improved accuracy within a general ML model, offering a practical and explainable alternative for clinical research.


A Simple Introduction to Validating and Testing a Model- Part 1

#artificialintelligence

The issues related to the Hold-out validation technique are solved in this technique. Here we will make sure that each set has got similar distribution which will eventually help us generate a better model. Now that we know what these two techniques are, let's have a look at the code We will be using python 3.0 Here df will now have the dataset that we want to use. We can see that the data has got 5 rows and 25 columns, where Survived is our target(dependent) variable and the rest are the independent variables.


Fifty new planets confirmed in machine learning first

#artificialintelligence

Fifty potential planets have had their existence confirmed by a new machine learning algorithm developed by University of Warwick scientists. For the first time, astronomers have used a process based on machine learning, a form of artificial intelligence, to analyse a sample of potential planets and determine which ones are real and which are'fakes', or false positives, calculating the probability of each candidate to be a true planet. Their results are reported in a new study published in the Monthly Notices of the Royal Astronomical Society, where they also perform the first large scale comparison of such planet validation techniques. Their conclusions make the case for using multiple validation techniques, including their machine learning algorithm, when statistically confirming future exoplanet discoveries. Many exoplanet surveys search through huge amounts of data from telescopes for the signs of planets passing between the telescope and their star, known as transiting. This results in a telltale dip in light from the star that the telescope detects, but it could also be caused by a binary star system, interference from an object in the background, or even slight errors in the camera.


Fifty new planets confirmed in machine learning first

#artificialintelligence

For the first time, astronomers have used a process based on machine learning, a form of artificial intelligence, to analyse a sample of potential planets and determine which ones are real and which are'fakes', or false positives, calculating the probability of each candidate to be a true planet. Their results are reported in a new study published in the Monthly Notices of the Royal Astronomical Society, where they also perform the first large scale comparison of such planet validation techniques. Their conclusions make the case for using multiple validation techniques, including their machine learning algorithm, when statistically confirming future exoplanet discoveries. Many exoplanet surveys search through huge amounts of data from telescopes for the signs of planets passing between the telescope and their star, known as transiting. This results in a telltale dip in light from the star that the telescope detects, but it could also be caused by a binary star system, interference from an object in the background, or even slight errors in the camera. These false positives can be sifted out in a planetary validation process.


50 new planets confirmed in machine learning first

#artificialintelligence

Fifty potential planets have been confirmed by a new machine learning algorithm developed by University of Warwick scientists. For the first time, astronomers have used a process based on machine learning, a form of artificial intelligence, to analyze a sample of potential planets and determine which ones are real and which are "fakes," or false positives, calculating the probability of each candidate to be a true planet. Their results are reported in a new study published in the Monthly Notices of the Royal Astronomical Society, where they also perform the first large scale comparison of such planet validation techniques. Their conclusions make the case for using multiple validation techniques, including their machine learning algorithm, when statistically confirming future exoplanet discoveries. Many exoplanet surveys search through huge amounts of data from telescopes for the signs of planets passing between the telescope and their star, known as transiting. This results in a telltale dip in light from the star that the telescope detects, but it could also be caused by a binary star system, interference from an object in the background, or even slight errors in the camera.


Validation techniques beyond K-fold

#artificialintelligence

A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning the model's hyperparameters. The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models. There is much confusion in applied machine learning about what a validation dataset is exactly and how it differs from a test dataset. Validation techniques in machine learning are used to get the error rate of the ML model, which can be considered as close to the true error rate of the population. If the data volume is large enough to be representative of the population, you may not need the validation techniques.


Improve Your Model Performance using Cross Validation (in Python / R)

#artificialintelligence

This article was originally published on November 18, 2015 and updated on April 30, 2018. One of the most interesting and challenging things about hackathons is getting a high score on both public and private leaderboards. I have closely monitored the series of Data Hackathons and found an interesting trend. This trend is based on participant rankings on the public and private leaderboards. One thing that stood out was that participants who rank higher on the public leaderboard lose their position after their ranks gets validated on the private leaderboard.


Optimizing Prediction Intervals by Tuning Random Forest via Meta-Validation

arXiv.org Machine Learning

Recent studies have shown that tuning prediction models increases prediction accuracy and that Random Forest can be used to construct prediction intervals. However, to our best knowledge, no study has investigated the need to, and the manner in which one can, tune Random Forest for optimizing prediction intervals { this paper aims to fill this gap. We explore a tuning approach that combines an effectively exhaustive search with a validation technique on a single Random Forest parameter. This paper investigates which, out of eight validation techniques, are beneficial for tuning, i.e., which automatically choose a Random Forest configuration constructing prediction intervals that are reliable and with a smaller width than the default configuration. Additionally, we present and validate three meta-validation techniques to determine which are beneficial, i.e., those which automatically chose a beneficial validation technique. This study uses data from our industrial partner (Keymind Inc.) and the Tukutuku Research Project, related to post-release defect prediction and Web application effort estimation, respectively. Results from our study indicate that: i) the default configuration is frequently unreliable, ii) most of the validation techniques, including previously successfully adopted ones such as 50/50 holdout and bootstrap, are counterproductive in most of the cases, and iii) the 75/25 holdout meta-validation technique is always beneficial; i.e., it avoids the likely counterproductive effects of validation techniques.


Testing Machine Learning Algorithms with K-Fold Cross Validation - Talend

#artificialintelligence

In an earlier post on Applying Machine Learning to IoT Sensors, I discussed the process for classifying sensor data with a machine learning algorithm. In this post, I'll give a background on choosing an algorithm, then using a validation technique. For the technique, I'll show how to apply it, and how it can be built using the Talend Studio without hand coding. Given a prediction scenario involving a machine learning algorithm, the first question to ask is what is the appropriate machine learning algorithm? Taking the example of predicting a user's activity based on mobile phone accelerometer data, we must be able to classify a category for the data (resting, walking, or running).


Exploiting random projections and sparsity with random forests and gradient boosting methods -- Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

arXiv.org Machine Learning

Within machine learning, the supervised learning field aims at modeling the input-output relationship of a system, from past observations of its behavior. Decision trees characterize the input-output relationship through a series of nested $if-then-else$ questions, the testing nodes, leading to a set of predictions, the leaf nodes. Several of such trees are often combined together for state-of-the-art performance: random forest ensembles average the predictions of randomized decision trees trained independently in parallel, while tree boosting ensembles train decision trees sequentially to refine the predictions made by the previous ones. The emergence of new applications requires scalable supervised learning algorithms in terms of computational power and memory space with respect to the number of inputs, outputs, and observations without sacrificing accuracy. In this thesis, we identify three main areas where decision tree methods could be improved for which we provide and evaluate original algorithmic solutions: (i) learning over high dimensional output spaces, (ii) learning with large sample datasets and stringent memory constraints at prediction time and (iii) learning over high dimensional sparse input spaces.